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Abstract 

Collecting large labeled data sets is a labo- 
rious and expensive task, whose scaling up 
requires division of the labeling workload be- 
tween many teachers. When the number 
of classes is large, miscorrespondences be- 
tween the labels given by the different teach- 
ers are likely to occur, which, in the extreme 
case, may reach total inconsistency. In this 
study we describe how globally consistent la- 
bels can be obtained, despite the absence of 
teacher coordination, and discuss the possi- 
ble efficiency of this process in terms of hu- 
man labor. We define a notion of label effi- 
ciency, measuring the ratio between the num- 
ber of globally consistent labels obtained and 
the number of labels provided by distributed 
teachers. We show that the efficiency de- 
pends critically on the ratio a between the 
number of data instances seen by a single 
teacher, and the number of classes. We sug- 
gest several algorithms for the distributed la- 
beling problem, and analyze their efficiency 
as a function of a. In addition, we provide an 
upper bound on label efficiency for the case of 
completely uncoordinated teachers, and show 
that efficiency approaches as the ratio be- 
tween the number of labels each teacher pro- 
vides and the number of classes drops (i.e. 
a -*■()). 



Preliminary work. Under review by the International Con- 
ference on Machine Learning (ICML). Do not distribute. 



1. Introduction 

As applications of machine learning mature, larger 
training sets are required both in terms of the number 
of training instances and the number of classes con- 
sidered. In recent years we have witnessed this trend 
for example in vision related tasks such as object class 
recognition or detection (Griffin et al., 2007; Evering- 
ham et al., 2007; Russell et al., 2005). Specifically 
for object class recognition, current data sets such as 
the Caltech-256 (Griffin et al., 2007) include tens of 
thousands of images from hundreds of classes. Col- 
lecting consistent data sets of this size is an intensive 
and expensive task. Scaling up naturally leads to a dis- 
tributed labeling scenario, in which labels are provided 
by a large number of weakly coordinated teachers. For 
example, in the Label- me system (Russell et al., 2005) 
the labels are contributed by dozens of researchers, 
while in the ESP game (von Ahn, 2006) labels are 
supplied by thousands of uncoordinated players. 

As we turn toward distributed labeling, several practi- 
cal considerations emerge which may disrupt the data 
integrity. In general, while it is reasonable to be- 
lieve that a single teacher is relatively self-consistent 
(though not completely error-free) , this is not the case 
with multiple uncoordinated teachers. Different teach- 
ers may have differences in their labeling systems due 
to several causes. First, different teachers may use 
different words to describe the same item class. For 
example, one teacher may use the word "truck" while 
the other uses "lorry" to describe the same class. Con- 
versely, the same word may be used by two teachers to 
describe two totally different classes, hence one teacher 
may use "greyhound" to describe the breed of dog while 
the other uses it to describe the C-2 navy aircraft. Sim- 
ilar problems occur when different teachers label the 
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data with different abstraction levels, so one general- 
izes over all dogs, while the other discriminates be- 
tween a poodle, a Labrador and etc. Finally, teachers 
often do not agree on the exact demarcation of con- 
cepts, so a chair carved in stone may be labeled as 
a "chair" by one teacher, while the other describes it 
as "a rock". All these phenomena become increasingly 
pronounced as the number of classes is increased, thus 
their neglect essentially leads to a severe decrease in 
label purity and consequently in learning performance. 

In this paper we study the cost of obtaining glob- 
ally consistent labels, while focusing on a specific dis- 
tributed labeling scenario, in which only some of the 
difficulties described above are present. To enforce the 
distributed nature of the problem, we assume that a 
large data set with n examples is to be labeled by a set 
of uncoordinated teachers, where each teacher agrees 
to label at most I -C n data points. While there is 
a one-to-one correspondence between the classes used 
by the different teachers, we assume that their label- 
ing systems are entirely uncoordinated, so a class la- 
beled as "duck" by one teacher may be labeled as a 
"goat" by another. In later stages of this paper, we 
relax this assumption, and consider a case in which 
partial consistency exists between the different teach- 
ers. Both scenarios are realistic in various problem 
domains. Consider for example a security system for 
which we have to label a large set of face images, in- 
cluding thousands of different people. Since teachers 
are not familiar with the persons to be labeled, the 
names they give to classes are entirely un-coordinated. 
The case of a partial consistency is exemplified in dis- 
tributed labeling of flower images: the layman can eas- 
ily distinguish between many different kinds of flowers 
but can name only a few. 

The difficulties of "one-to-many" label correspondence 
between teachers and concept demarcation disagree- 
ments are not met by our current analysis, which fo- 
cuses on the preliminary difficulties of distributed la- 
beling. Another related scenario, to which our analysis 
can be extended relatively easily, is the case in which 
the initial data is labeled by uncoordinated teachers 
right from the start. Consider for example, the task 
of unifying images labeled in a site like Flickr 1 into 
a meaningful large training data set. Our suggested 
algorithms and analysis apply to this case with minor 
modifications. 

1.1. Relevant literature 

In the active learning framework (Cohn et al., 1990) 
and the experimental design framework (see e.g., 

x http:/ /www. flickr. com/ 



(Atkinson & Donve, 1992)), the goal is to minimize 
the number of queries for labels (or experiments con- 
ducted) while learning a target concept. It has been 
shown (Freund et al., 1997) that a careful selection 
of queries can lead to an exponential reduction in the 
number of labels needed. This line of research is mo- 
tivated by the costly and cumbersome process of ob- 
taining labels for instances. We share this motivation 
but argue that the problem is not merely the quantity 
of labels but also the quality and the consistency of 
the labels that should be treated in the data collection 
process. 

The problem of quality of labels, i.e., learning with 
noise, has been addressed extensively in the machine 
learning literature (see e.g., (Decator, 1995)). In this 
line of work it is assumed that the teacher does not 
always provide the true instance labels. The sever- 
ity of noise ranges from adversarial noise, in which 
the teacher tries to prevent the learning process by 
providing inaccurate labels, to the more benign ran- 
dom classification noise. While the inconsistency be- 
tween uncoordinated teachers can be regarded as some 
form of label noise, it has unique characteristics and its 
treatment is hence different from the other sources of 
noise mentioned. Specifically, as long as each teacher is 
noise-free and self-consistent, we are able to eliminate 
the noise completely and achieve certain labels. 

The scenario of distributed labeling with uncoordi- 
nated teachers was considered in the "equivalence con- 
straints" framework (Bar-Hillel et al., 2005). When 
learning with equivalence constraints, the learner is 
presented with pairs of instances and the annotation 
suggests whether they share the same class or not. The 
authors conjectured that as the number of classes in- 
crease, the labeling effort required to coordinate the 
labels from different teachers becomes prohibitive. We 
prove this conjecture in Theorem 3. Alternatively, 
equivalence constraints can be used as a direct supervi- 
sion for the learning algorithm. Indeed, (Bar-Hillel & 
Weinshall, 2003) proved that a concept class is learn- 
able with equivalence constraints if it is learnable from 
labels, so this alternative has some appeal. 

1.2. The distributed labeling problem 

In the distributed labeling task we have to reveal the 
labels of n instances {xi, . . . ,x n }. We assume that 
there exist "true" labels y\, . . . ,y n (with yj — y{xj)) 
and the distributed labeling algorithm should return 
yi,...,y n such that y~i = yj if and only if yi = y 3 . We 
denote the number of classes by c, and assume that 
each teacher is willing to label only I — ca instances 
where l,c <C n. Throughout this paper we assume that 
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the labels provided by teachers are consistent with the 
true labels in the sense that for any teacher t and any 
pair of instances Xi , xj 

[t (xi) = t ( Xj )] [yi = yj] . (1) 

where t (x) is the label given by teacher t to instance 
x. However, apart from section 4, we assume no inter- 
teacher consistency with respect to class names, i.e., 
teachers may disagree on the names of the different 
classes. To measure the competence of different algo- 
rithms for combining the labels of the different teach- 
ers we define the following: 

Definition 1 Denote by a = {xi,yi}™ =1 an input se- 
quence of n points with the labels yi G {l,..,c}. A 
distributed labeling algorithm alg is f (a, alg) efficient 
if 



c^oo n^oo supg. (labels (alg, a, cot)) 

where labels (alg, a, I) is the average ( over the inter- 
nal randomness of the algorithm) number of human- 
generated labels the algorithm alg uses to label the se- 
quence a, where each teacher is willing to label I ex- 
amples. 

Clearly, if no structural assumptions are made on true 
labels then / (a, alg) is bounded by 1 from above. We 
denote by /* (a) the optimal efficiency for a given a. 
I.e., /* (a) = sup alg / (a, alg). 

1.3. Main results 

In section 2 we present several algorithms for solv- 
ing the distributed labeling problem. The first algo- 
rithm presented is the contract the connected compo- 
nents (C 3 ) algorithm. We show that this simple al- 
gorithms has efficiency of 1 — 0-- ex p(- a ))/a. We then 
improve this algorithm with the representatives algo- 
rithm and prove its efficiency to be better than the 
efficiency of the previous algorithm. In section 3 we 
present an upper bound on the achievable efficiency. 
We show that /* (a) < min ( 2a /(i+a), 1). In section 4 
we study a relaxed version of the distributed labeling 
problem in which there exists some consistency be- 
tween the different teachers. Thus, with some prob- 
ability p two teachers will agree on the name of a 
given class. In this setting, we present a revised ver- 
sion of the C 3 algorithm and show its efficiency to be 

■y l-cxp(-q) 

a— cxp( — a)+cxp( — p)) ' 



Algorithm 1 The Contract the Connected Compo- 
nents (C 3 ) algorithm 
input: n unlabeled instances x\, . . . , x n 
output: a partition of x\, . . . , x n into classes accord- 
ing to the true labels 

1. Let G be the edge-free graph whose vertexes are 
X\ , . . . , x n . 

2. While G is not a clique 

(a) pick I random nodes U = {x^ , . . . ,x it } which 
are not a clique from G. 

(b) send U to a teacher and receive y^ , . . . , y^ . 

(c) for every 1 < r < s < I do 

i. if yi r = yi s then contract the vertices Xi r 
and Xi s in the graph G. 

ii. if yi r yi s then add the edge {xi r ,x is ) to 
the graph G. 

3. Mark each vertex in G with a unique number from 

[1...C]. 

4. For every vertex in G, propagate its label to all 
the nodes that were contracted into this vertex. 



2. Label-efficient algorithms 

As described in 1.2, we assume in this section that the 
name each teacher assigns to a class is meaningless. 
Therefore, the best we can hope for is to break the 
n instances into c classes such that any pair of points 
share the same class label if and only if all teachers give 
these two points the same label. In this section we sug- 
gest two algorithms for this task. The bounds obtained 
for these algorithms are presented in Figure 1. 

2.1. The Contract the Connected Components 
(C 3 ) algorithm 

The first algorithm we consider is the Contract the 
Connected Components (C 3 ) algorithm presented in 
Algorithm 1. The idea behind this algorithm is to 
build a graph whose nodes are sets of equivalent in- 
stances. Whenever we find that two nodes share the 
same label, we contract them into a single node. On 
the other hand, whenever we find that two nodes do 
not share the same label, we generate an edge between 
them. The algorithm ends when the remaining graph 
is a clique. At this point, each of the nodes is assigned 
with a unique label. These labels propagate to all the 
points to be labeled, since each point is associated with 
a single node in the clique. 

The correctness of the algorithm is straightforward due 
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to the self-consistency of the teachers. In Theorem 1 
we show the label efficiency of the C 3 algorithm to be 
1 — (1 — cxp(— a)) /a where a = l/c. The main idea 
behind the analysis is to study the expected number 
of contractions in each iteration. 

Theorem 1 The label efficiency of the C 3 algorithm 
is lower-bounded by 

1 — — (1 — exp (—a)) . 
a 

Before proving the theorem, we present a lemma in 
which the contraction rate associated with a single 
teacher is bounded. 

Lemma 1 Assume a teacher labels I random example 
(I — > oo ) from c = l /a different classes. The expected 
number of unique labels that the teacher will give to 
the I instances is at most I times Q (a) where 

Q(a) = -(l-exp(-a)) . 
a 

Note that the number of unique labels is exactly the 
number of nodes that will be left after contracting the 
I instances. 

Proof: Assume that the probability for seeing each of 
the classes is pi. The result follows from the following: 

E [number of unique labels] 

= c — E [number of labels not seen] 

= c-]T(l- ft ) Z (2) 

<- .-H)' 

= c(l -exp(-a)) (3) 

= I ■ — (1 — exp (—a)) . 
a 

The correctness of (3) follows since we are assuming 
that I, c — > oo while a = l/c is constant. rj 

Proof: (of Theorem 1) At each round of the C 3 algo- 
rithm, I elements are sent to be labeled by a teacher. 
From Lemma 1 we have that the number of remaining 
elements is on avarage at most IQ (a) . 

Therefore, the expected number of rounds the algo- 
rithm will make until finished is 

n 

Z (1-1(1 -exp (-a))) ' 

Note that the number in the denominator is the ex- 
pected number of removed elements at each round. 



Thus, the number of labels used is 

n 

(1-1(1 -exp (-«))) ' 

Plugging this number into the definition of label effi- 
ciency gives the desired result. rj 

2.2. The representatives algorithm 

Each teacher provides us with two types of informa- 
tion sources. One is positive equivalence constraints, 
i.e., the knowledge that two instances share the same 
label. The other is negative equivalence constraints, 
i.e., the knowledge that two instances do not share the 
same label. While the C 3 algorithm is very effective 
in using positive equivalence constraints, it makes very 
little use of negative equivalence constraints. The rep- 
resentatives algorithm (Algorithm 2) tries to exploit 
this type of information as well. The main idea be- 
hind this algorithm is first to find all the points that 
belong to certain classes. Once we know that the re- 
maining points do not belong to any of these classes, 
we are left with a problem with fewer instances and 
fewer potential classes and thus an "easier one". 

In order to detect all the points belonging to a certain 
class we use representatives. A representatives set is 
a set of c instances {xi lt ..Xi c }such that for each class 
there is exactly one member (representative) of the 
class in the representatives set. Finding a represen- 
tatives set is a simple task and can be done without 
affecting the overall efficiency, since its label complex- 
ity does not depend on n. Therefore, for the sake 
of simplicity we assume that the representatives set 
is given in advance. We further assume that we know 
the probability of each representative class. This infor- 
mation too can be easily estimated from data without 
jeopardizing efficiency. 

(3 is the proportion of representatives in the I instances 
each teacher labels. Note that when (3 — 0, the repre- 
sentative algorithm is essentially the same as the C 3 
algorithem. However, when (3 > 0, we use the fact 
that after all the points were compared against a cer- 
tain representative, we are guaranteed to have found 
all the points with the same label as this representa- 
tive, and thus we can eliminate this class. 

Theorem 2 The label efficiency of the representative 
algorithm is lower-bounded by 

(l-P)(l-qf 
where r = -Sj = -kj is the number of sets in the 
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Algorithm 2 The Representatives Algorithm 
Inputs: 

• n unlabeled instances, x 1 , . . . ,x n 

• a set ai, . . . , a c of representatives such that a-i G 

{ x li ■ ■ ■ ! x n\ 

• a list of probabilities pi , . . . , p c such that pi is the 
probability of seeing an instance from the class of 
ai. 

Outputs: a partition of the n points into c label classes 

1. Reorder the representatives and the pi's such that 
Pi > P2 > ■ ■ ■ > Pc- 

2. Let* (3 e (0,1) 

3. Partition the set of representatives into r 
sets So,...,S r -i classes such that Si = 
{dipi+i, ■ ■ ■ , a(i+i)0i}- 

4. Let G be the edge free graph whose vertices are 
X\ , . . . , x n . 

5. While G is not empty 

(a) For i = . . . r — 1 

i. Partition the remaining points in the 
graph into sets of size (1 — (3) I. 

ii. For each subset of (1-/3)1 points: 

A. send these points together with Si to a 
teacher. 

B. contract the graph according to the la- 
bels returned by the teacher. 

iii. For every aj E Si 

A. label aj with the label j, and propogate 
this label. 

B. remove aj from G. 

* Choose {3 to optimize the bound in Theorem 2. 



partition of the representatives into (31 sets and 2 q = 
Q(a(l-/?))= 1 -^y» ■ 

Proof: In each round of step 5a we break G into 
\G\ / (1(1- (3)) parts and thus use \G\ / (1 - f3) labels. 
Therefore, we need only to estimate the size of G after 
each round. Denote the number of vertices in G at the 
beginning of the round i by In order to bound gt we 
should consider how it is affected by two ingredients: 
first the contraction which happen in the same fashion 



as it happens in the C 3 algorithm and the complete 
elimination of classes 1, ..,i(3l. 

We use Lemma 1 to analyze the contraction rate. Each 
teacher sees 1(1-/3) instances which are not represen- 
tee of some classes. These instances come from c — i(3l 
different classes and thus, from Lemma 1 the contrac- 
tion rate is 



i(3l 



l-ia/3 



Out of the remaining points, all the points which are 
being represented in Si are eliminated. Due to the 
reordering of the PiS, these points are at least a fraction 
of V(r-i) of the remaining points. Thus 



9i+i < 



r-(i + l) Q fa(l-py 




r J yi- Ja p 



The number of labels used in all the rounds is therefore 

r-l 



< 



nbg(-9B«(&3) « 



< 



- g(i-I)g (a( i- fl , 



2 The Q function is defined in Lemma 1. 



(1-/3) (I-?) 2 

where (??) is due to the monotonicity of the Q func- 
tion. Using the last expression in the efficiency defini- 
tion completes the proof. [— ] 

The expression obtained in theorem 2 can be computed 
numerically for any value of a, j3 and so it can be used 
to optimize j3 for a given a. When the optimal (3 is 
used, the representers algorithm outperforms the C 3 
algorithm as seen in Figure 1. 

3. The optimal efficiency 

In the previous section we studied the efficiency of sev- 
eral algorithms. In the current section we study the 
efficiency of the optimal algorithm. That is, we study 
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the function 



f* (a) 



sup / (a, alg) 

alg 



We give an upper bound on /* (a) showing that 
algorithms cannot have an efficiency greater than 
min (1, 2a /(i+a)). This bound asserts that the labeling 
problem is not trivial in the sense that it is not always 
possible to achieve efficiency 1. Moreover, the problem 
becomes hard in the limit of a — ► 0, as the efficiency 
drop linearly with a in this region. Comparing the 
bound shown here and the efficiency of the algorithms 
presented in previous sections, one can see that there 
is still a significant gap between the achieved and the 
(maybe) achievable. 

Theorem 3 Let f* (a) be the best achievable effi- 
ciency for a given a then 

Proof: Fix n and c and assume I — ac. If a > 1 then 
the required bound is trivial since efficiency cannot ex- 
ceed 1. Therefore, we are only interested in the cases 
where a < 1. Let alg be a distributed labeling algo- 
rithm. For each of the n instances we choose a class 
label uniformly and independently from the c possible 
labels. We analyze the expected number of teacher 
calls needed before the class assignments are found. 

Fix an instance x, we first analyze the expected num- 
ber of teacher calls (in which x participates) before it 
is first contracted with some other point. Assume that 
x has i edges in the graph G, i.e., there are i instances 
for which it is known that x does not share its label. If 
x' is a different point than x, the probability that they 
share the same label is at most 1 /(c-i). To see this, 
note that for any legal label assignment to G \ {x}, 
there are at least c — i uplifts of this assignment to G. 

Let P (i) be the probability that x is contracted at 
least once during its first i comparisons to other in- 
stances. We claim that P (i) < i /c for all 1 < i < c. 
Clearly, P(0) = 0. The proof is by induction. For 
i = 1, clearly the probability for contraction with the 
first point x is compared against is 1 /c Note that 

P(i + l) 

= P (i) + (1 - P (i j) Pr [contract at step i + 1] 
1 



< P(i) + {l-P(i)) 



< - 1-^1 + 

c \ c — i I c — i 



c — i 

1 i + 1 



In the previous calculation, we assumed that x is com- 
pared to other points one at a time. However, the 
teachers label I instances at a time, thus whenever x is 
sent to a teacher, it is compared against I — 1 points. 
Note that an instance keeps being sent to teachers at 
least until it is first unified. Therefore, the number of 
teachers that will have to label x until its label is dis- 
covered, is at least the total number of teachers that 
will have to label x until it is unified at least once with 
another instance. From this we obtain the following 
lower bound for the expected number of teachers that 
see x: 

E [number of teachers that see x] 
= Pr [number of teachers > j] 

3 

= (1 — Pr [number of teachers < j]) 

3 

(c-l)/ (! _i) 

> ]T (i-P((j-i)(i_i))) 

> £ a _ o - m - 1) 

c — 1 1/c — A/c— 1 



I - 1 2 \l - 1 

The efficiency can be derived from this term 

f* (a) 



< 1/ lim 

= 1/ 



c- 1 1 ( c- I 



l-l 2 \ 



c- 1 



2a 
1 + a 



□ 



4. Learning with name-consistent 
teachers 

In previous sections we assumed that class names 
used by different teachers are totally uncoordinated, 
so naming conventions of one teacher are meaningless 
to the other. While this scenario may occur (like in 
the 'face labeling' task mentioned in the introduction), 
in most cases this assumption is too pessimistic. It is 
more reasonable to assume that some level of agree- 
ment regarding class names exist, though this agree- 
ment is partial and not perfect. In this section we 
assume that there exist < p < 1 such that with 
probability p over the choice of a random teacher t and 
class j, the teacher uses the true global class name j 
as the class label: 
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Figure 1. The efficiency (Y-axis) of the C 3 algorithm and 
the anchor algorithm are plotted together with the bound 
on the optimal efficiency (Theorem 3) for different values 
of a (X-axis). 



Vx Pr(t(x) = y{x)) > p . (5) 

We assume some sort of a probability measure over the 
teachers and the classes. If the pool of teachers is fi- 
nite, it can be the uniform distribution, and otherwise 
we assume that whenever we need another teacher to 
label some instances, the teacher will be such that (5) 
is true. Notice that we also keep our previous assump- 
tion that all the teachers are class consistent in the 
sense of (1). 

When p = 1 the assumption (5) means that all the 
teachers use the same global naming system , i.e. 
t(xj) = yj for all t,j. In this case the labeling prob- 
lem is trivial, and it is easy to obtain label efficiency 
of 1 simply by splitting the instances between differ- 
ent teachers. On the other hand, when p is very small, 
there is no name consistency and the situation boils 
down to the scenario studied in Section 2. Therefore, 
we will now focus on studying name consistency in the 
general case when p € (0, 1). 

The algorithm we present to address this situation is 
the Consistently Contract the Connected Components 
(C 4 ) (Algorithm 3). The difference between the C 4 al- 
gorithm and the C 3 algorithm is that the C 4 algorithm 
sends teachers instances that were previously given the 
same label by some other teachers. 

The C 4 algorithm differs from the C 3 algorithm in us- 
ing the labels for selecting better candidates for send- 
ing to the same teacher. However, note that we still 
declare the equivalence of two instances only when a 
single teacher labels both with the same label. There- 



Algorithm 3 The Consistently Contract the 
Connected Components (C 4 ) algorithm 
Input: n unlabeled instances x\,...,x n 
Output: a partition of %i, . . . ,x n into classes according 
to the true labels 

1. Let G be the edge free graph whose vertices are 

3*1 j • • • 5 %n ■ 

2. Label each vertex with 0. 

3. While G is not a clique 

(a) pick I random nodes U — {xi t , . . . , Xi l } from 
G such that all these nodes have the same 
label. 

(b) send U to a teacher and receive y^,..., y^ . 

(c) for every 1 < r < I , label Xi T with the label 
Vi r - 

(d) for every 1 < r < s < I do 

i. if yi T = yi s then contract the vertices Xi r 
and Xi s in the graph G. 

ii. if yi r ^ yi B then add the edge {x% r , Xi s ) to 
the graph G. 

4. Mark each vertex in G with a unique number. 

5. For every vertex in G propagate its label to all 
the nodes that were contracted into this vertex. 



fore, due to the class consistency (1) the correctness of 
the algorithm is guaranteed. We now turn to proving 
its efficiency. 

Theorem 4 The label efficiency of the C 4 algorithm 
is lower bounded by 

^ 1 — cxp (—a) 

a — exp (—a) + exp (—a (1 — p)) 

Proof: Following the proof of the efficiency of the C 3 
algorithm, we compute the rate in which the size of G 
reduces. However, we need to consider two settings. 
The first applies to teachers that label points for the 
first time. The second case to consider is teachers who 
label points that were previously labeled by some other 
teacher. While these cases may be interleaved in time 
according to algorithm C 4 , w.l.o.g. we may analyze 
them as if they occur in two consecutive phases. 

Following Lemma 1, teachers who label points that 
were not previously labeled will leave for further pro- 
cess IQ (a) points out of every I labeled points. Thus 
the first phase of labeling will require n labels and will 
leave nQ (a) points in the graph G. 
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In the second phase, each teacher is fed with points 
that received the same label by different teachers. Due 
to the name consistency (5) out of I points that a 
teacher labeled we expect pi of them to have the same 
label due to the name consistency. The other points 
are subject to contraction. From Lemma 1 and the 
above argument we expect that from every I points 
only 1 + (1 — p) IQ (a (1 — p)) will remain. The number 
of labels used by teachers labeling previously labeled 
points is 

nQ (a) 



l-(l-p)Q{a(l-p))-\ 
Thus, the overall number of labels used is 
Q(a) 



1 



(l-p)Q(a(l-p))-\ 
which leads to the efficiency of 

l-(l-p)Q(a(l-p))-\ 



lim 



ooQ(a) + l-(l-p)Q(a(l-p))- j 
1 — exp (—a) 
a — exp (—a) + cxp (—a (1 — p)) 



□ 



One can easily verify, that if p = the label efficiency 
of the C 4 algorithm is identical to that of the C 3 algo- 
rithm. However, the difference between the C 3 algo- 
rithm and C 4 algorithm is profound when p — ► 1 and 
a — > 0. In this setting, the C 3 algorithm has efficiency 
of { a /2) + o (a) while the C" 4 algorithm is [}/2) - o (1) 
efficient. 

Note that despite the remarkable improvment, when 
p = 1 there exists complete name consistency and thus 
it is trivially possible to achieve the perfect efficiency 
of 1. However, it is not clear if it is possible to get 
efficiency close to 1 if p is slightly less than 1. This 
remains as an open problem. 

5. Conclusions and further research 

In this work we have studied the problem of generat- 
ing consistent labels for a large data set given that the 
labels are provided by restricted teachers. We have 
focused on the problems arising when the labels used 
by different teachers are un-coordinated, but never- 
theless a one-to-one (unknown) correspondence exists 
between their labeling systems. In this framework, 
we provided several algorithms and analyzed their ef- 
ficiency. We also presented an upper bound which 
shows that the problem is non-trivial, and becomes 
hard as the number of classes grows. In the limit a —* 
we characterize the achievable efficiency to be in the 



range 3 [( 2 /3) a, 2a], however the exact value remains 
as an open problem. 

We believe that the process of collecting data for large 
scale learning deserves much attention. One interest- 
ing extension of this work is to the case where the sym- 
metry between teachers is broken, either by consider- 
ing different noise levels to their labels, or more gener- 
ally, by also allowing the noise level to change between 
the different classes. In such scenarios, a 'teacher se- 
lection' problem arises as the identity of the teacher 
can be very informative. One example is the problem 
of "provost-selection" in which most of the teachers are 
useless novices in some domain-specific issues and thus 
it is essential to first find the experts ("provosts") and 
use only the labels they provide. A related problem 
arises when all teachers are useful, but they differ in 
their discrimination resolutions, so one teacher may 
say that an image contains a bird while the other may 
describe the exact bird species. Such problems are left 
for further research. 
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3 The representers algorithm achieves efficiency of ( 2 /z) a 
with P — !/3 and a — > 0. To see this, plug these values in 
(4). 



